AD-A283  921 

*  m  ...  Miaa  mil  KUl  Bj|  HI!  till 


A  Practical  Algorithm  for  Integer  Sorting 
on  a  Mesh-Connected  Computer 

(Preliminary  Version) 

Nathan  Folwell  Sumanta  Guha,  *  Ichiro  Suzuki  * 

Department  of  Computer  Science 
University  of  Wisconsin-Milwaukee 
P.0.  Box  784 
Milwaukee,  WI  53201 

5,  1994 


Abstract 

This  paper  presents  count-sort,  a  parallel  algorithm  for  mesh-connected  computers  to 
sort  integers  where  the  range  of  inputs  is  known.  A  straightforward  counting  technique 
that  has  not  been  implemented  previously  in  parallel  sorting  algorithms  is  presented.  On 
a  mesh-connected  computer  with  \fN  x  y/N  processors  we  are  able  to  sort  N  integers 
in  the  range  1 . . .  y/N  in  time  cy/N  where  c  is  very  small.  For  practical  values  of  N,  the 
algorithm  is  extremely  fast.  Further,  it  is  possible  to  expand  the  range  by  a  factor  k  to 
1  . . .  ky/N  so  that  the  slowdown  is  less  than  k. 

We  produce  an  implementation  of  count-sort  on  the  S1MD  MasPar  MP-1  with  8192 
processors  that  sorts  8-bit  integers  significantly  faster  than  the  manufacturer’s  current 
library  routine  for  sorting  8-bit  integers. 


1  Introduction 


The  study  of  parallel  algorithms  is  increasingly  becoming  one  of  the  most  important  areas  in 
computer  science.  A  very  practical  and  interesting  architecture  for  parallel  algorithms  is  the 
mesh.  Its  regular  interconnection,  ideal  for  VLSI  implementation,  is  easily  scalable. 

A  fundamentally  important  problem  for  the  mesh-connected  architecture  is  that  of  finding 
efficient  sorting  algorithms.  In  fact,  sorting  is  often  a  key  step  in  other  mesh  algorithms. 
Several  practical  0(y/N )  time  algorithms  to  sort  on  a  y/N  x  y/N  mesh  have  been  proposed 
[4,  6,  7,  10].  In  the  model  where  there  is  initially  one  element  per  processor  and  the  target 
’Communicating  author. 
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order  is  snake-like  row-major,  Schnorr  and  Shamir  [9]  developed  an  algorithm  that  runs 
in  time  3 y/N  +  o(y/N),  which  is  asymptotically  near  optimal  as  a  provable  lower  bound 
is  3 y/N  —  o(y/N)  [5,  9].  However,  their  algorithm  is  only  practical  for  very  large  N .  More 
recently,  Krizanc  [3]  presented  the  first  deterministic  sorting  algorithm  in  a  similar  model  that 
overcomes  the  3 y/N  —  o(y/N)  bound  given  that  input  is  drawn  from  integers  in  the  range 
1  ...N,  by  using  counting  techniques.  This  is  analogous  to  the  situation  in  the  sequential 
model  where,  given  information  about  the  range  of  inputs,  it  is  possible  to  sort  faster  than 
the  lower  bound  of  fl(AlogjV)  thai  holds  for  arbitrary  inputs  [2]. 

We  present  a  parallel  sorting  algorithm,  count-sort,  for  mesh-connected  computers  that 
sorts  N  integers  in  the  range  1  . .  .ky/N  faster  than  the  above  algorithms  for  practical  values  of 
k  and  N .  Count-sort  is  fast  because  it  is  not  comparison  based.  Instead,  a  counting  technique 
is  used  to  achieve  high  speeds.  Further,  it  is  practical,  and  we  have,  in  fact,  implemented 
it  as  an  extremely  fast  sorting  routine  on  the  MasPar  MP-1:  on  an  8192  processor  MasPar 
MP-1  our  routine  sorts  8-bit  numbers  30%  faster  than  the  current  8-bit  sorting  routine  in 
the  MasPar  software  library.  Such  routines  to  sort  “short”  integers  have  many  applications. 

In  section  2  we  define  our  models  of  computation.  In  section  3  we  describe  count-sort  and 
analyze  its  running  time.  We  first  develop  the  algorithm  on  a  simple  model  of  computation. 
Next,  we  modify  the  algorithm  for  a  more  powerful  model  that,  in  fact,  better  resembles 
machines  currently  available  on  the  market.  In  section  4  we  present  an  implementation  of 
count-sort  on  the  commercially  available  MasPar  MP-1,  with  time  comparisons  between  our 
implementation  and  the  MasPar  library  sort.  Section  5  presents  conclusions  and  possible 
extensions  to  the  algorithm. 


2  Models  of  Computation 

Here  we  present  two  models  of  computation  for  analyzing  our  algorithm.  The  first  is  a  simple 
model  to  develop  the  algorithm,  while  the  second  has  additional  capabilities. 


2.1  Simple  Model  of  computation 

Assume  there  are  N  processors  which  are  arranged  in  a  y/N  x  y/N  mesh.  Each  processor” 
is  connected  to  its  four  nearest  neighbors.  Processors  on  the  perimeter  of  the  mesh  havej 
wrap-around  connections.  We  identify  each  processor  with  a  unique  ID  of  the  form  (i,j)  ^ 
where  i  is  the  row  number  and  j  is  the  column  number  (see  Figure  1).  These  ID  numbers 
can  also  be  used  to  identify  processors  in  row  major  order.  For  example,  in  Figure  1,  the 
processor  with  ID  (2,3)  is  the  7th  processor  in  row  major  order  for  a  4  x  4  mesh. 
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Each  processor  can  perform  simple  programming  operations  and  route  a  single  value  to 
one  of  its  four  neighbors  in  constant  time. 

The  programming  operations  that  are  performed  are  similar  to  any  high  level  program¬ 
ming  language:  the  conditional  if  statement,  the  assignment  statement,  logical  and  arithmetic 
operations  are  all  assumed  to  execute  in  time  to- 

Define  the  route  command  to  be  SEND.{N,S,E,W}[uar].  For  example,  SEND_N[inpu<] 
would  send  input  on  every  processor  to  the  input  register  to  the  north.  This  includes  the 
wrap-around  connections.  Assume  the  SEND  operation  executes  in  time  ts- 

All  operations  are  performed  simultaneously  in  SIMD  manner  on  all  processors  unless 
specified  by  a  conditional  statement.  If  a  conditional  statement  is  used  then  the  processors 
where  the  conditional  is  true  will  perform  the  operation  while  the  other  processors  are  idle. 

2.2  A  More  Powerful  Model 

It  is  useful  to  analyze  our  algorithm  on  a  more  powerful  model  that  better  represents  machines 
currently  available  on  the  market.  This  model  has  three  additional  capabilities. 

The  first  capability  which  is  available,  for  example,  on  the  the  MasPar  MP-1,  allows  for 
full  permutation  routing  in  constant  time.  Define  this  operation  to  be  PERMUTE[t;ar,<ies<], 
which  routes  the  values  in  var  to  the  processors  with  ID  value  dest.  Note  that  dest  is  a 
variable  on  each  processor.  This  is  a  powerful  operation.  It  is  implemented  on  the  MP-1 
with  a  three  stage  hierarchy  of  crossbar  switches,  called  the  router  [1,  11].  The  time  for  this 
operation  is  tp. 

The  second  capability,  also  available  on  the  MasPar  MP-1,  allows  a  variable  to  be  sent  in 
any  of  the  four  compass  directions  an  arbitrary  number  of  steps  in  constant  time  provided 
the  intermediate  processors  are  idle.  Define  this  operation  to  be  SEND[dist]{N,S,E,W}[t;or]. 
For  example,  in  Figure  1,  SEND[3]S[input]  sends  the  contents  of  input  from  row  1  to  row  4 
in  constant  time  if  processors  in  rows  2  and  3  are  idle.  The  time  for  this  operation  is  t$D- 
The  third  capability  is  SEND_COPY,  which  is  the  same  as  the  more  powerful  SEND,  but 
a  copy  of  var  is  left  in  processors  along  its  path.  For  example,  SEND_C0PY[3]S[mput]  still 
sends  input  three  processors  to  the  south,  but  each  intermediate  input  register  gets  a  copy  of 
the  original  input  as  well.  The  time  for  this  operation  is  tc- 

These  three  capabilities  are  common  not  only  to  the  MP-1.  Other  commercial  machines 
such  as  the  MasPar  MP-2,  Cambridge  Parallel  Processing  DAP,  and  the  DEC  MPP  have 
similar  capabilities. 
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3  The  Algorithm 


Initially,  each  processor  contair  input  integer  from  the  range  1  ...y/N.  When  the  algo¬ 
rithm  completes  inputs  are  sor  ording  to  row-major  order.  More  formally,  the  ( i,j)th 

processor  will  contain  the  (i  +  (J  -  1)  *  y/N)th  smallest  element. 

The  idea  underlying  count-sort  is  to  use  the  knowledge  that  the  input  range  is  “small” 
to  replace  the  compare-exchange  schemes  of  mesh  sorts  for  arbitrary  input  with  an  efficient 
scheme  to  count  occurrences  of  every  possible  input. 

Each  processor  has  three  registers,  scratch ,  count ,  and  output.  The  register  scratch  holds 
inputs.  Both  count  and  output  are  initialized  to  0.  See  Figure  2  for  an  example  of  an  initial 
configuration  on  a  4x4  mesh. 

Before  we  proceed  we  need  two  definition:"  5h  NUMBER(t)  to  be  the  number  of 
occurrences  of  each  input  value  equal  to  i,  and  LtADF»(i)  =  X^_jNliMBER(j). 

For  example,  if  we  have  the  list  213213122  then 

Number(I)  =  3,  Number(2)  =  4,  and  Number(3)  =  2, 

and 


Leader(I)  =  3,  Leader(2)  =  7,  and  Leader(3)  =  9. 

Notice  that  if  the  list  above  is  sorted  tol  1  122223  3,  then  Leader(i)  is  the  position  for 
the  last  occurrence  of  each  i. 


3.1  The  Simple  Model 

We  describe  the  five  stages  of  count-sort  in  the  next  five  subsections,  and  in  the  sixth  sub¬ 
section  we  give  an  analysis. 


3.1.1  Vertical  Counting 

In  this  first  stage,  processor  (i,j)  counts  occurrences  of  input  i  in  column  j.  To  accomplish 
this,  use  the  mesh  connections  to  fully  “rotate”  the  the  input  values  around  each  column  in 
y/N  steps  (see  Figure  3): 

for  y/N  steps  do 

if  ( scratch  —  x)  then  count  =  count  +  1 
SEND_S[scratc/j] 

Analysis:  y/N  route  steps,  y/N  comparisons,  y/N  assignments,  and  y/N  increments  requir¬ 
ing  (y/N)ts  +  (3 y/N)to  time  steps. 
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3.1.2  Calculating  Number(i) 

At  this  point  each  processor  contains  a  partial  count  of  the  input  values.  Clearly,  summing 
the  contents  of  count  across  processors  of  row  i  will  compute  Number(i). 

This  is  nearly  identical  to  the  vertical  counting  step,  but  we  route  horizontally  and  perform 
an  unconditional  addition  between  scratch  and  count  (see  Figure  4): 

scratch  =  count 
for  \fN  -  1  steps  do 
S  END  _E[scro£c/i] 
count  =  count  +  scratch 

Now  the  contents  of  count  in  each  processor  of  row  i  contains  Number(i). 

Analysis:  y/N  -  1  route  steps,  y/N  -  1  increments,  and  y/~N  assignments  requiring  ( y/N  - 
l)ts  +  (2 y/N  —  l)t0  time  steps. 

3.1.3  Calculating  Leader(i) 

To  calculate  Leader(i),  we  perform  a  prefix  sum  down  the  columns.  Specifically,  send  the 
contents  of  scratch  vertically  down  the  mesh  performing  additions  between  scratch  and  count 
at  each  step.  This  produces  Leader(i)  in  the  contents  of  count  across  processors  of  row  t 
(see  Figure  5): 

scratch  =  count 
for  i  =  1  to  (y/N  -  1)  do 
SEND_S  [scratch] 

if  ( V  >  t)  then  count  =  count  +  scratch 

Analysis:  y/~N  —  1  route  steps,  VTV  — 1  increments,  y/N  —  1  comparisons,  and  y/~N  assignments 
requiring  (y/N  -  1  )ts  +  (3\/iV  -  2 )to  time  steps. 

3.1.4  Routing  i  to  Processor  Leader(i) 

At  this  point,  we  know  the  value  of  LEADER(i).  Now,  we  shall  send  the  value  of  i  to  the 
processor  with  ID  Leader(i).  To  accomplish  this  we,  again,  use  the  mesh  connections  to 
fully  “rotate”  the  data  around  the  mesh.  The  modification  in  this  case  is  that  we  route 
two  values:  the  number  i  and  its  value  Leader(i).  The  output  registers  get  the  value  t. 
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Notice  that  it  is  not  necessary  to  SEND  east  or  west  due  to  each  column  containing  the  same 

information  in  processors  of  the  same  row  (see  figure  6): 

scratch  =  i 
for  y/N  steps  do 

if  ( count  =  i  +  (j  —  1  )y/N)  then  output  =  scratch 

SEND_S  [scratch] 

SEND_S[coun£] 

Analysis:  2 y/N  route  steps,  y/N  comparisons,  y/N  +  1  assignments,  and  y/N  additions 

requiring  (2 y/N)ts  +  (3 y/N  +  1  )to  time  steps. 

3.1.5  Filling  in  the  Rest 

The  final  step  is  to  set  output  for  processors  that  are  between  processors  with  ID  Leader(i). 

This  step  completes  the  sorting  algorithm  (see  Figure  7): 

for  y/N  steps  do 

if  (j  /  1)  and  ( output  /  0)  then  SENDJW[ou£pti£] 
scratch  =  output 

if  (j  =  1)  and  ( scratch  /  0)  then  SEND_W[scra£ch] 

if  ( j  =  y/N)  and  ( scratch  /  0)  and  (i  /  1)  then  SEND_N[scra£c/i] 

if  ( output  =  0)  and  ( scratch  /  0)  then  output  =  scratch 

if  (i  /  1)  then  SEND_N[scna£c/i] 

for  y/N  steps  do 

if  (j  =  y/N)  and  ( output  =  0)  then 
output  =  scratch 
SEND  [scratch] 
for  y/N  steps  do 

if  (j  /  1)  and  ( output  /  0)  then  SEND_W[ou£pu£] 

Analysis:  3 y/N  +  3  route  steps,  6 y/N  +  8  comparisons,  and  y/N  +  2  assignments  requiring 

(3 y/N  +  3 )ts  +  (7 y/N  +  10)to  time  steps. 

3.1.6  Final  Analysis 

Summing  the  times  of  the  five  stages  we  get  a  toted  of  time  steps  for  count-sort: 
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(8Sn  +  l)ts  +  {l8y/N  +8)to.  (1) 

It  is  possible  to  improve  the  running  time.  Reynolds  [8]  points  out  that  a  slight  modifica¬ 
tion  to  the  routing  stage  of  our  algorithm  in  section  3.1.4  will  yield  a  considerable  speed-up 
as  follows. 

We  use  the  fact  that  all  processors  of  row  i  contain  the  values  of  Leader(j)  in  count 
and  i  in  scratch.  If  the  processor  position  is  less  than  or  equal  to  count  then  we  set  output 
to  scratch ,  so  that  we  can  eliminate  the  last  stage  described  in  Section  3.1.5.  The  modified 
fourth  stage  is  then: 

scratch  =  i 
for  y/N  steps  do 

if  ( count  >  i  +  (j  —  l)y/N)  then 
output  =  scratch 
SEND  _S[scratc/i] 

SEND_S[cormt] 

Analysis:  2 y/N  route  steps,  y/N  compare  steps,  y/N -f  1  assignment  steps,  and  y/N  additions 
requiring  (2 y/N)ts  +  (3 y/N  +  1  )to  time  steps. 

After  this  modification,  the  algorithm  is  finished  and  the  fifth  stage  is  not  needed. 

This  improvement  reduces  the  running  time  to: 

(bVN -2)ts  +  {Uy/N -2)t0.  (2) 

Compare  the  running  time  of  count-sort  to  the  running  times  of  a  few  existing  practical 
mesh  sorts  for  arbitrary  inputs  that  are  based  on  the  same  SIMD  model: 


Mesh  Sort 

Time  Steps 

Count-sort 

(by/N  -  2)ts  +  {Uy/N  -  2)t0 

Kumar  and  Hirschberg  [4] 

(Uy/N)ts  +  (4.5 log2  y/N)to 

Nasimi  and  Sahni  [7] 

(14(  v'iV  -  1)  -  8  log  y/N)ts  +  (6.5  log2  v'jV  +  2.5  log  y/N)t0 

Thompson  and  Kung  [10] 

(14(\/iv  -  1)-  8  \ogy/N)ts  +  (2  log2  y/N  -1-log  y/N)to 

Table  1 :  Comparing  count-sort  with  other  mesh  sorts. 
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It  may  be  seen  that,  for  sorting  in  the  range  l  . .  count-sort  is  faster  for  practical  N .  In 

fact,  if  ts  is  of  the  same  size  as  to  then  count-soit  is  faster  than  the  other  sorts  for  meshes 
containing,  at  least,  up  to  240  processors,  while  if  ts  >>  to,  which  is  usually  the  case  with 
real  machines,  count-sort  is  even  faster. 

3.2  Adaptation  to  a  More  Powerful  model 

Count-sort  can  be  modified  to  run  even  more  efficiently  on  our  second  model  of  computation 
(see  Section  2.2).  We  examine  each  stage  of  the  above  algorithm  to  see  if  we  are  able  to  take 
advantage  of  the  additional  capabilities. 

3.2.1  Vertical  Counting 

This  stage  remains  the  same. 

Analysis:  (^N)ts  +  (3\/jV)to  time  steps. 

3.2.2  Calculating  Number(j) 

We  can  improve  this  stage  by  observing,  for  this  model,  we  need  Number(i)  in  only  one 
column,  say  the  first.  We  compute  a  prefix  sum  to  the  first  columns  in  A  log  N  steps  as 
follows. 

We  use  the  enhanced  SEND  (see  Section  2.2)  performing  additions  between  processors  of 
distances  that  increases  by  a  factor  of  2  (see  Figure  8)  until  the  prefix  sum  is  computed  in 
the  first  row. 

i  =  1 

scratch  =  counter 
while  i  <  y/N  do 
SEND[i]W  [scratch] 
counter  =  counter  +  scratch 
i  =  2  *  i 

Analysis:  |logA  enhanced  SEND  steps,  |logA  additions,  ^logiV  multplications,  and 
(2  +  logn)  assignments  requiring  (^  log  N)tsD  +  (f  l°g  N  +  2)to  time  steps. 

3.2.3  Calculating  Leader(i) 

This  stage  remains  the  same. 
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Analysis:  (y/N  -  1  )ts  +  ('iy/N  -  ‘2)to  time  steps. 

3.2.4  Routing  i  to  Processor  Leader(i) 

This  stage  is  improved  by  routing  i  in  a  single  permute  step.  We  know  the  value  of  Leader(i) 
for  all  i.  This  information  is  used  to  send  each  i  to  the  position  Leader(i)  with  the  command: 

PERMUTE[i,Leader(i)]. 

Analysis:  1  full  permutation  route  requiring  tp  time  steps. 

3.2.5  Filling  in  the  Rest 

We  can  fill  in  the  rest  of  the  output  registers  with  the  SEND-COPY  command  (see  Sec¬ 
tion  2.2).  This  is  performed  in  the  same  manner  as  the  simple  model,  but  here,  instead  of 
sending  variables  across  the  mesh  with  3 y/N  SEND  operations,  we  replace  the  latter  with  3 
SEND_COPY  operations. 

Analysis:  3  SEND_COPY  steps,  14  comparisons,  3  assignments,  and  3  SEND  steps  requiring 
3 tc  +  17 to  +  3 ts  time  steps. 

3.2.6  Final  Analysis 

It  follows  that  on  the  improved  model,  the  total  of  time  steps  is: 

(2 y/N  -f  2)ts  +  (Gy/N  +  ^logA  +  17)f0  +  log  N)tSD  +  3*c  +  tp  (3) 

3.3  Expanding  the  Range 

It  is  possible  to  expand  the  range  of  integers  by  a  factor  k  while  not  increasing  the  running 
time  by  a  factor  k.  To  expand  the  range  by  a  factor  k  we  need  k  extra  counter  and  scratch 
registers,  following  which  the  overall  algorithm  remains  similar.  Details  are  omitted  in  this 
version. 

We  achieve  slowdown  less  than  k  by  carefully  choosing  the  elements  to  route  and  com¬ 
parisons  to  make.  For  example,  in  the  vertical  counting  stage  it  is  possible  to  count  k  input 
values  per  processor  using  no  extra  route  steps:  simply  route  inputs  as  before,  but  perform 
k  comparisons  after  each  route  step.  This  does  not  increase  the  number  of  routes,  though 
the  number  of  comparisons  increases  by  a  factor  k.  Other  stages  may  be  sped  up  similarly, 
and  observe  that  the  last  two  stages  of  the  algorithm,  in  fact,  need  not  be  altered  at  all  for 
a  larger  range. 
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4  The  Implementation 


We  implemented  count-sort  on  a  MasPar  MP-1.  The  machine  our  algorithm  was  implemented 
on  has  8192  processors  arranged  in  64  rows  and  128  columns.  The  implementation  follows 
closely  with  the  modified  version  of  the  algorithm  presented  in  Section  3.2.  Even  though  the 
mesh  is  not  square,  the  algorithm  is  essentially  the  same. 

The  implementation  was  written  in  MasPar’s  Massively  Parallel  Language  which  is  an  ex¬ 
tended  C.  It  was  timed  against  the  current  library  function  psort8u  for  sorting  8-bit  unsigned 
integers.  In  the  table  below,  we  give  timings  in  number  of  clock  ticks  to  sort  8- bit  integers  on 
8192  processors.  Inputs  were  distributed  across  the  mesh  using  the  pseudo-random  number 
generator  p_random.  For  each  range,  we  ran  both  routines  1000  times  separately  as  the  only 
job  on  the  machine  and  took  the  average. 


Range 

Psort8u 

Count-Sort 

%  Speed-Up 

0. 

..31 

31262.634 

21568.108 

31.0 

0. 

..63 

31263.178 

21644.416 

30.8 

0. 

..127 

31262.898 

21727.774 

30.5 

0. 

..191 

31263.582 

21779.876 

30.3 

0. 

..255 

31262.986 

21800.624 

30.3 

Table  2:  Comparing  count-sort  with  the  MasPar  library  sort. 


5  Conclusions  and  Future  Work 

We  have  presented  a  straightforward  counting  algorithm  for  sorting  integers  on  a  mesh- 
connected  computer  with  y/N  x  y/~N  processors,  that  sorts  N  integers  in  the  range  1 . .  .ky/N 
in  time  cy/N  where  c  is  very  small.  For  practical  values  of  k  and  N ,  the  algorithm  proves  to 
be  very  fast,  both  in  theory  and  in  implementation. 

It  is  possible  that  this  method  can  be  expanded  to  sort  on  a  larger  range.  One  possibility 
is  to  use  count-sort  as  a  component  of  a  parallel  sorting  algorithm  similar  to  sequential  radix 
sort.  Further,  the  counting  techniques  themselves  may  be  useful  in  applications  other  than 
sorting. 
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Figure  1:  A  4x4  mesh  with  wrap-around  connections. 
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Figure  2:  Initial  configuration  of  scratch,  count,  and  output  shown  in  order. 


Figure  3:  Configuration  of  registers  after  vertical  counting  step. 


12 


Figure  4:  Registers  after  calculating  Number(x). 


Figure  5:  Registers  after  calculating  Leader(x). 


Figure  6:  Registers  after  routing  i  to  processor  Leader(x). 
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Figure  7:  Registers  after  final  stage. 


after  adding  neighbors 


after  adding  between  every  2nd  processor 


EKHXKHXH] 

after  adding  between  every  4th  processor 

Figure  8:  Logarithmic  computation  of  NUMBER(t). 


14 


